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ABSTRACT 

Recent research in cognitive psychology has drawn attention to the important role 
that students' personal understandings and representations of subject matter play in the 
learning process. This paper briefly reviews some of this research, and contrasts the kind 
of learning that results in an individual's changed conception or view of a phenomenon 
with the more passive, additive kind of learning assessed by most traditional achievement 
tests. To be consistent with a view of learning as an active, constructive process, 
educational tests are required which focus on key concepts in an area of learning, and 
which take into account the variety of types and levels of understanding that students have 
of those concepts. In these tests, scoring responses right and wrong is likely to be less 
appropriate than using students' answers to infer their levels of understanding. This will 
require not only imaginative new types of test items, but statistical models that permit 
inferences about students' understandings once their responses have been observed. 
Psychometric approaches are sketched to construct measures of achievement from such 
tests. 
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1. INTRODUCTION 

Implicit in much of oui current measurement theory and practice is a view of 
learners as passive absorbers of provided wisdom. Most items on standard achievement 
tests assess students' abilities to recall and apply facts and routines presented during 
instruction. Some require only the memorization of detail; they seek evidence that students 
have absorbed factual details presented in class and are able to reproduce these on 
command. Other achievement test items, although supposed to assess higher-level learning 
outcomes like "comprehension" and "application", often require litde more than the ability 
to recall a formula (e.g., s = v 0 t + 1/2 a£) and to make appropriate substitutions to arrive at 

a correct answer. 

Test items of this type are consistent with a view of learning as a passive, receptive 
process through which new facts and skills are added to a learner's repertoire in much the 
same way as bricks might progressively be added to a wall. The process is additive and 
incremental: students with the highest levels of achievement in an area are those who have 
absorbed and can reproduce the greatest numbers of facts and formulae. The practice of 
scoring answers to items of this type either "right" or "wrong" is consistent with the view 
that individual units of knowledge or skill are either present or absent in a learner at the time 
of testing. Under this approach, diagnosis is a simple matter of identifying unexpected 
holes or gaps in a student's store of knowledge. These are subareas of learning in which 
knowledge is "missing" and in which there is a need for remedial teaching to fill a deficit 

This approach to the measurement of acnievement may be appropriate for some 
forms of learning — as when the learner's task is in fact to master a body of factual material. 
In recent decades, however, significant advances have occurred in our understanding of the 
ways in which students learn. In particular, there has been an increased awareness of the 
active, constructive nature of most forms of human learning and of the important role that 
students' personal conceptions and representations of subject matter play in the learning 
process. Rather than being a passive process of absorbing new material as it is 
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encountered, meaningful learning is increasingly being recognized as an active process 
through which students construct their own interpretations, approaches, and ways of 
viewing phenomena, and through which learners relate new information to their existing 
knowledge and understandings. Under this view of learning, the difference between 
beginning and advanced learners is seen not so much as a difference in amount of factual 
knowledge (although this is usually an important aspect of competent performance), as a 
difference in the types of conceptions and understandings that students bring to a problem, 
and in the strategies and approaches that they use. 

Support for this view of learning can be found in recent studies in a number of 
areas of investigation. In cognitive science, comparisons of novices and experts in various 
fields of learning show that expertise typically involves much more than mastery of a body 
of facts: experts and novices usually have very different ways of viewing phenomena and 
of representing and approaching problems in a field (e.g., Chi, Feltovich, & Glaser, 1981, 
in physics; Chase & Simon, 1973, in chess; Lesgold, Feltovich, Glaser, & Wang, 1981, in 
radiology; and Voss, Greene, Post, & Penner, 1983, in social science). Expert-novice 
studies suggest that the performances of teginning learners often can be understood in 
terms of the inappropriate or inefficient models that these learners have constructed for 
themselves. 

Similar observations have been made in the field of science education (see Driver & 
Easley, 1978; Osborne & Wittrock, 1983; Posner, Strike, Hewson, & Gertzog, 1982). 
Research into students' science learning has drawn attention to the frequent mismatch 
between intuitive understandings that students bring to the classroom and the conceptual 
frameworks assumed by teachers. Caramazza, McCloskey, and Green (1981) observe that 
the scientific "principles" that students abstract from everyday experience are often 
strikingly at variance with the most fundamental physical laws. These misunderstandings 
can go undetected by teachers if correct answers to test questions depend only on 
superficial knowledge of formulae and formula manipulation techniques (Clement, 1982). 

ERIC J 



New Views 
3 



There is evidence that students can succeed in high school and even college science courses 
while still maintaining many of their misconceptions and without acquiring an 
understanding of underlying principles (White and Horwitz, 1987). 

Related work in Sweden (Marton, 1981; Entwistle and Marton, 1984; Dahlgren, 
1984; Saljo, 1984) has used clinical interviews to expire the different understandings that 
students have of key principles and phenomena in a number of fields of learning. These 
interviews have revealed a range of student conceptions of each of the phenomena that the 
studies have explored, and have illustrated the importance of forms of learning which 
produce "a qualitative change in a person's conception of a phenomenon" from a lower- 
level, more naive conception to a more expert understanding of that phenomenon 
(Johansson, Marton, & Svensson, 1985. 235). 

Under this view of learning, a student is rarely considered to have no understanding 
or no strategy when addressing a problem. Even beginning learners are considered to be 
engaged in an active search for meaning, constructing and using naive representations or 
models of subject matter. Rather than being "wrong", these representations frequently 
display partial understanding and are applied rationally and consistently by the individuals 
who use them. In arithmetic, for example, "it has been demonstrated repeatedly that 
novices who make mistakes do not make them at random, but rather operate in terms of 
meaning systems that they hold at a given time" (Nesher, 1986; also see Brown & Burton, 
1978). 

An implication of this view of learning for the assessment and monitoring of 
student learning is that much greater cognizance must be taken of the understandings and 
models that individual students construct for themselves during the learning process. In 
many areas of learn levels of achievement might be better defined and measured not in 
terms of the number;, (nets and procedures that a student can reproduce, but in terms of 
his or her levels of understanding of the key concepts and principles that underlie a learning 
area (Glaser, 1981; Glaser, Lesgold, & Lajoie, 1987; Greeno, 1976). 
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An example of a study that has investigated students' levels of understanding is 
Carpenter and Moser's (1984) study of children's arithmetic skills. Carpenter and Moser 
found that most children in the first to third grades of school are able to provide correct 
answers to single-digit addition questions like 6+8=?. But children have a variety of 
methods of answering questions of this kind (see Table 1). These different methods 
indicate different levels of understanding and proficiency in single-digit addition. Some 
children solve 6+8=? by counting out six objects and another eight objects, and then 
counting all 14 (category 1). Later, children reach an understanding that counting does not 
have to begin at the number one. They "count on", although not necessarily from the larger 
number (e.g., "6; 7,8,...,14"; category 2). Later still, children understand the commutative 
property of addition (6+8 = 8+6) and consistently count on from the larger number ("8; 
9, 10,..., 14"; category 3). Finally, by third grade, many children can solve 6+8=? using 
number facts, without counting objects (category 4). To monitor developing competence in 
single-digit addition, it is not sufficient to record only whether or not a child can provide 
the correct answer to a question like 6+8=?. By keeping track of the strategy that a child 
uses it is possible to infer the kinds of understanding that she or he has developed. 



Insert Table 1 about here 

This paper considers the problem of constructing measures of achievement that are 
based not on tests of learners' abilities to recall facts and apply memorized routines, but on 
inferences about students' levels of understanding of key concepts m an area of learning. 
Particular attention is given to the requirements of an achievement testing methodology if it 
is to be consistent with a view of learning as an active, constructive process. 
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2. CONVENTIONAL ACHIEVEMENT TESTING 

Techniques for constructing achievement tests have been developed and refined 
over many decades. Most achievement tests begin with a statement of the instructional 
objectives to be assessed by each test. According to Bloom, Hastings, and Madaus (1971, 
28), these objectives should be stated as directly observable student behaviors which can be 
reliably a orded as either present or absent. They should be "stated in terms which are 
operational, involving reliable observation and allowing no leeway in interpretation". To 
achieve this degree of reliability, test constructors are encouraged to write items to assess 
students' abilities to perform unambiguous, observable tasks such as "stating," "listing," 
"naming," "selecting," "recognizing," "matching," and "calculating" (Bloom et al., 1971, 
34). 

This emphasis on specifying and testing precise student behaviors has led to the 
construction of achievement tests composed of discrete items, each relating to a particular 
behavioral objective, and each scorable as either right or wrong. Multiple choice items 
have become especially popular in achievement tests because they can be scored quickly, 
unambiguously, and even by machine. In some areas of education, machine-scored 
multiple choice tests have become the principal mode of evaluating student learning. A 
disadvantage of conventional achievement tests is that, through their emphasis on precisely- 
defined student behaviors, they can encourage students to focus their efforts on relatively 
superficial forms of learning (Frederiksen, 1984). 

In parallel with these developments in the practice of educational measurement, 
psychometric methods have been developed for the analysis of students' performances on 
test items of this kind. These methods have been introduced to transform records of right 
and wrong answers into measures of achievement, and to evaluate the reliability and 
validity of these measures. The more complex analytical methods, based on item response 
theory (IRT), take into account not only differences in the difficulties of individual test 
items, but also differences in their discriminating powers and, in the case of multiple choice 
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items, differences in their probabilities of being guessed correctly (Lord, 1980). Under 
CRT as well as under classical test theory, however, examinees' scores are essentially 
summaries of their tendencies to make correct rather than incorrect answers. 

The alternative to conventional achievement testing discussed in this paper begins 
with a consideration of the key concepts, principles and phenomena that underlie a course 
of instruction and around which factual learning can be organized. Rather than recording 
students' understandings of these concepts as simply "right" or "wrong", this alternative 
approach recognizes that learners have a variety of understandings of phenomena, and that 
some of these understanchngs ave less complete than others. The purpose of assessment is 
not to establish the presence or absence of specific behaviors, but to infer the nature of 
students' understandings of particular phenomena. Consequently, systems of observation 
very different from collections of distinct and conceptually isolated multiple choice test 
items are required 

3. BUILDING / CHIEVEMENT TESTS AROUND KEY CONCEPTS 

The construction of an achievement test usually begins with a table of specifications 
with subject matter on one axis, and types of learning outcomes on the other. Items are 
written to cover outcomes like "knowledge of terminology ," "knowledge of specific facts," 
and "principles and generalizations." In the use of such a table, these outcomes are treated 
as different but equivalent: the aim is to write items to cover each. However, because of 
the requirement that items be based on observable behaviors that can be scored right or 
wrong, and because it is easier to write items to assess students' knowledge of facts and 
procedures than to assess their understandings cf principles and generalizations, 
achievement tests tend to be tests of students' abilities to recall and apply factual 
knowledge. 

The method being proposed here begins by identifying key concepts in an area of 
instruction and building assessment procedures around these. These are fundamental 
principles, understandings, and approaches that a course of instruction aims to develop. 
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The difference between this approach and the conventional practice of treating "knowledge 
of principles" as an instructional objective of much the same status as "knowledge of facts" 
or "knowledge of terms" is that this approach makes the assessment of conceptual 
understanding the primary focus of the testing procedure. 

A second fundamental difference between this approach and the usual approach to 
achievement testing is the emphasis placed on understanding how students view and think 
about key concepts. Rather than comparing students' responses with a "correct" answer, 
the emphasis is on inferring the nature or level of understanding reflected in each student's 
response. 

One area in which a great deal of work has been done to understand how students 
think about and approach phenomena in that of physics education. Studies in several 
countries have explored students' understandings of such concepts as acceleration 
(Trowbridge and McDermott, 1981), electric charge, enthalpy and entropy, force and 
motion (Viennot, 1979), gravitation (Champagne, Klopfer and Anderson, 1980; Gunstone 
and White, 1981), light and the transmission of heat, momentum, potential difference, 
proportionality, torque, and such principles and models as Newton's laws, conservation 
laws, the atomic model, and electron flow models for circuits. 

A common technique in these studies has been to ask students to describe what is 
happening in drawings of simple physical systems (e.g., to predict what will happen to an 
object, to describe the forces acting on a body, or to draw the trajectory that an object will 
follow). During these interviews, students arc asked to explain their responses and their 
explanations are tape recorded (Johansson, Marton, & Svensson, 1985; McCloskey, 
1983). In other studies, students have been asked to manipulate an apparatus in a 
laboratory to achieve particular effects (e.g., to apply a force to make a body move in a 
particular direction), while their explanations and comments are tape recorded and later 
transcribed (McDermott, 1984). Still other researchers (e.g., diSessa, 1982; White, 1983) 
have developed interactive software for this purpose. In these studies, students are asked 
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to apply "forces" to simulated objects on a screen to make them move to specified 
positions, to speed up, to slow down, and so on. 

An example of the kind of question posed in these studies, taken from the work of 
McDermott (1984), is shown in Figure 1. In this study, students were presented with a 
drawing of a pendulum and asked to draw the trajectory that the weight would follow if the 
string of the pendulum broke when it was midway through its swing (i.e, in the vertical 
position). Four of the trajectories commonly drawn by students are shown in Figure 1. 

Insert Figure 1 about here 



Drawings B, C and D are all incorrect, but they reflect different levels of 
understanding. Drawings B and C show some understanding that the object will continue 
moving to the right after the string breaks (Newton's first law). Students who draw 
trajectory D show no understanding of this and recognize gravity as the only influence on 
the object's trajectory. Drawing B is almost correct: these students do not understand that 
the combination of a constant horizontal velocity and a vertical acceleration will be a 
parabolic trajectory. Drawing C shows the object continuing in the upward path that it 
would have followed had the string not been cut, and then falling under the influence of 
gravity. This drawing suggests a naive "impetus" theory of motion, a commonly held 
belief that an object will "ohtinue in its path (even a curved path) after the removal of the 
force that kept it moving .a that path, until the object's "impetus" dissipates. 

The observations made in these studies suggest that students do not simply make 
"random errors" but operate in terms of naive theories about physical phenomena. In the 
area of force and motion, these theories can be "remarkably well-articulated, ... quite 
consistent across individuals, ... and strikingly inconsistent with the fundamental principles 
of classical mechanics" (McCloskey, 1983, 299). In his studies of students' attempts to 
control a simulated object on a screen, diSessa (1982, 38) found "a surprising structure of 
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discrete and definite theories" about how forces influence motion. And, through their 
interviews with Swedish students about aspects of science learning, Johansson et al. 
(1985) arrive at a similar conclusion: 

In our case, a discovery of decisive importance was that for each 
phenomenon, principle, or aspect of reality, the understanding of which we 
studied, there seemed to exist a limited number of qualitatively different 
conceptions of that phenomenon, principle, or aspect of reality, (pp. 235-6) 

A number of researchers have observed that the same naive conceptions can be 
found among students of different ages and with different educational backgrounds. 
McCloskey (1983), for example, found the same types of naive physical theories among 
students who had never taken physics, high school physics students, and college physics 
students. The only difference was in the frequencies of occurrence of these different 
understandings. McDermott (1984) reports an identical observation in a Norwegian study 
of high school physics students, future high school science teachers, and physics 
graduates. 

A significant finding of these studies is that some students can succeed on 
traditional achievement tests and graduate from high school and even college physics 
courses with their naive conceptions of physical principles largely unchanged. Through 
their physics courses students are able to "master certain methods of calculation without 
having adopted the conceptualization underlying them" (Johansson et al., 1985, 235). 
Indeed, a misconception "may go undetected because a student's superficial knowledge of 
formulas and formula manipulation techniques can mask his or her misunderstanding of an 
underlying concept" (Clement, 1982, 66). The result is that "many students emerge from 
their study of physics and physical science without a functional understanding of some 
elementary but fundamental concepts" (McDermott, 1984, 31). 

These findings invite a reconsideiation of the way in which we think about and 
attempt to measure science learning. Clearly, many students are succeeding on precise, 
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operationally-defined objectives without developing an understanding of the material that 
they are learning. For many science educators, the answer is to place greater emphasis not 
on the learning of scientific facts and formulae, but on changing students' ways of thinking 
about scientific phenomena: 

The formal learning of science can be viewed as involving, at least in part, a 
shift from one set of beliefs about the physical world to another, one set of 
conceptions to another. (Osborne and Wittrock, 1985, 81). 

and 

In our view, learning (or the kind of learning we are primarily interested in) 
is a qualitative change in a person's conception of a certain phenomenon or 
of a certain aspect of reality. (Johansson et al., 1985, 235). 

4. CONSTRUCTING ORDERED OUTCOME CATEGORIES 

Having identified key concepts in an area of learning and devised contexts (items) 
through which students' understandings of these concepts can be investigated, the next task 
is to delineate a set of categories for each item, through which student's observed 
responses are related to unobservable states of understanding. In this section and the two 
following, we address applications in which the most prevalent states of understanding can 
be ordered. This notion of order is oasic to a view of learning as a "shift" in a student's 
understanding, with a shift constituting the desired "learning" when the change is from a 
lower level, more naive understanding to a higher level, more expert conception of a 
phenomenon. 

This is not to say that all conceptions that students might bring to an item can be 
ordered from best to worst. We return later in the paper to consider some ways to model 
conceptions that differ but are not obviously more or less sophisticated. We begin here, 
however, by assuming the existence of a set of ordered categories for any given item (as 
will be illustrated below). For some items this set of categories might be constructed by 
grouping similarly sophisticated understandings. These constructed categories provide a 
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conceptual framework for recording an individual's response, and introduce the possibility 
of basing measures of achievement on inferences about students' levels of understanding. 

Grouping students' responses to construct a set of categories of understanding is 
part of the method used by Marton (1981) and his colleagues at the University of 
Gothenburg. These researchers interview students to explore their understandings of 
particular concepts and principles, transcribe tape recordings of these interviews, and then 
carry out detailed analyses of transcripts. "The aim of the analysis is to yield descriptive 
categories representing qualitatively distinct conceptions of a phenomenon". These 
categories form an "outcome space" which provides "a kind of analytic map" of students' 
understandings of each phenomenon. Learning is thought of as "a shift from one 
conception to another" on this map (Dahlgren, 1984, 24-31). 

Carpenter and Moser (1984) provide a picture of such a map. From their analysis 
of students' performances on single-digit addition tasks, they . constructed the five ordered 
outcome categories shown in Table 1. Children in category 0 are unable to solve 6+8=?. 
Children in category 1 understand that 6+8=? can be solved by counting the total number of 
objects in two groups of sizes 6 and 8. Children in category 2 also understand that the 
counting of objects does not have to begin at the number one, and so "count on." Children 
in category 3 understand the commutative property and count on from the larger number. 
Children in category 4 have a level of understanding that enables them to use number facts 
to solve 6+8=? without counting. 

Figure 2 shows the proportion of a group of Wisconsin children in each of the five 
outcome categories at each of eight time points during their first three years of school. At 
the beginning of first grade (bottom of the map), about 15 percent of these children could 
not solve problems like 6+8=?, even with blocks (Category 0). Among those children who 
could solve such a problem, by far the most common strategy was to count out six objects 
and another eight objects and then to count all fourteen (Category 1). By the beginning of 
the second grade, almost all these children understood that counting does not have to begin 
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a'c the number one and were counting on (Categories 2 and 3), although some still did not 
understand the commutative property and were not counting consistently from the larger 
number. By the eighth round of observations (top of the map), more than 70 percent of 
this group of children could solve single-digit addition problems without having to count 
objects. Carpenter and Moser provide similar outcome maps for other aspects of addition 
and subtraction learning. 



Insert Table 1 and Figure 2 about here 



5. COLLECTING OBSERVATIONS 

While conversations with students are probably essential for identifying the variety 
of understandings that learners have of phenomena and for constructing sets of outcome 
categories, interviews are not practicable as a basis for achievement testing. Alternative 
observation methods must be found which will permit inferences to be made about 
students' levels of understanding. These procedures must go deeper than identifying 
incorrect answers: they must attempt to identify the nature of the understandings and 
models that individual students are employing. In general, this will require imaginative 
new approaches to achievement testing. 

One possible approach is the "rule assessment" procedure developed by Siegler 
(1978, 1981). This approach uses a carefully constructed set of questions designed to 
expose different levels of understanding of a concept. While each individual question 
might be scored as right or wrong, neither the response to any one item nor total score on a 
set of items are sufficient to differentiate students using different rules. Rather, it is a 
student's pattern of right and wrong answers that constitutes a basis for inferring his or her 
level of understanding. 
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Another approach is to use computer-administered tasks as the testing medium. 
This approach introduces the possibility of matching each student's response to a library of 
common responses rather than to a single "correct" answer. In the pendulum task in Figure 
1, for example, students might be asked to draw a trajectory on a screen and each student's 
drawing might then be referred to a library of common student responses. In this way, a 
student's response might automatically be assigned to one of several ordered outcome 
categories for that task, and a record made of the student's apparent conception or theory 
concerning that phenomenon. 

A decision about a student's assignment to an outcome category might be based on 
the students' responses to several related questions, looking for, in Brown and Burton's 
(1978) terminology, consistent "bugs" in their solutions. The automatic generation of 
hypotheses about students' understandings might be followed by further questions aimed at 
corifirming those hypotheses. Does a student who draws trajectory C in Figure 1 also 
believe that an object fired out of a curved tube will continue in a curved path for a short 
time after leaving the tube? Through carefully designed hints and subquestions it may be 
possible to emulate in a crude way the type of exploration that can be done through an 
interview to trace a student's misunderstanding to its source. Ordered outcome categories, 
for example, might then be defined in terms of responses to a set of related questions or 
tasks. 

In an achievement test of this type, tasks may bear little resemblance to traditional 
achievement test questions. As diSessa (1982) and White (1983) show, a great deal of 
information can be collected about individuals' naive theories of force and motion by 
asking them to move simulated objects on a screen. A computer can be used to keep 
detailed records of when students apply forces, in which directions they apply those forces, 
and how they respond to the motion that they produce. Automatic analyses of student 
records might be used to infer students' levels of understanding. Simulations of this kind 
could be used in a wide variety of learning areas — for example, the use of simulated patient 
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management problems to explore students' levels of understanding of medical principles 
and to expose inappropriate or potentially misleading ways of thinking about particular 
phenomena (of course, the analysis of these data would be far more complex than the 
simple examples given here). 

6. CONSTRUCTING MEASURES OF ACHIEVEMENT 

If the types of observations that result from these testing procedures are to provide a 
basis for achievement measurement and are to be a viable alternative to conventional 
achievement tests, then models and methods analogous to those that have been developed 
for right/wrong test questions are required to supervise the construction of the new 
measures. 

The starting point in the development of a method for ordered outcome categories is 
a matrix of observations like the matrix shown in Table 2. This hypothetical data matrix 
shows the responses of 32 students to 8 items (e.g., Carpenter & Moser's single-digit 
addition items). Responses to each item are recorded in one of five ordered categories 
(labelled 0 to 4). Students' scores on each item have been arranged in this matrix in an 
orderly way with abrupt transitions between adjacent categories. (This can be seen by 
reading down each column.) The consequence of ordering scores on each item in this way 
is that it is possible to infer from the full data matrix in Table 2 an unambiguous order, for 
these 32 students on the single achievement dimension defined by these eight items. 



Insert Table 2 about here 



It is unlikely that a perfectly orderly pattern of scores on an item will occur in 
practice. The transition from category x-\ to category x of an item is not likely to be sharp, 
as depicted in Table 2, but to be gradual. Rather than expecting a person above a particular 
level of ability in an area of learning to definitely score x rather than x-\ on an item, it is 
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more realistic to imagine a score of x becoming more likely than a score of x-\ at higher 
levels of ability. In other words, a probabilistic formulation will in general be more 
appropriate than a deterministic representation (see Wilson, 1989a). 

The psychometric method described here, the Partial Credit Model (PCM; Masters, 
1982; Wright and Masters, 1982), proposes that the probability of a person scoring x rather 
than x-\ on a particular item i will increase steadily with ability in an area of learning such 
that 



where Jt n bc is the probability of person n responding in category x 
(;c=l,2,...,mj) of item i, 8 n is person n's level of proficiency in the area of learning 
measured by this set of items, and 8i x is a parameter associated with the transition between 
outcome categories x-\ and x of item i. 

The consequence of applying the simple logistic expression [1] to the transition 
between each pair of adjacent outcome categories for each item, is that a connection is 
formed between the ordered categories for that item and the underlying variable that the set 
of items is used to measure. It is this connection that enables performances on each item to 
be used to estimate students' locations on the underlying variable. The nature of this 
probabilistic connection is illustrated in Figure 3, in terms of response probabilities for a 
lypothetical single digit addition problem. 



Figure 3 shows how, under the PCM, the probability of a response in each 
category of an item changes with increasing student proficiency. It has been drawn to 




[1] 



Insert Figure 3 about here 
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resemble Figure 2. The difference is that Figure 3 does not show observed proportions of 
students in each category, but modelled proportions. For any given level of 6, one looks 
across the graph to determine the probabilities of a response in category at this level of 
proficiency. The basic shapes of the five zones in Figure 3 are fixed by the PCM and are 
the consequence of using the simple logistic expression [1] to model the transition between 
adjacent categories of each item. The widths and locations of the zones for each item are 
estimated from students' responses to that item, and are expressed through the 8 

parameters. 

The probabilistic partial credit model depicted in Figure 3 enables measures of 
achievement to be constructed from inferences of students' levels of understanding of each 
of a number of concepts or phenomena in an area of learning. A student's 6 parameter 

indicates not simply a tendency to make correct responses, but tendencies to provide 
answers reflecting the various levels of understanding on a collection of tasks probing that 
understanding. The model serves the same function in the analysis of responses recorded 
in ordered outcome categories as the item response models that have been developed for 
dichotomously-scored responses (Rasch, 1960; Lord and Novick, 1968; Lord, 1980), 
summarizing, in terms of the task and person parameters, the patterns in the data that are 
consonant with a conception of student proficiency. Estimation procedures and tests of 
model-data fit for the PCM are described by Wright and Masters (1982). Tests of item fit 
(which can be thought of as comparisons of the observed and modelled maps for an item as 
shown in Figures 2 and 3) provide internal consistency indices analogous to traditional item 
statistics like biserial correlations. Tests of person-fit flag occurrences of unusual response 
patterns, as might occur when a student's state of understanding is atypical, and requires 
special attention. 
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7 . PARTIALLY-ORDERED STATES 

The psychometric model just described can be used when a set of ordered 
categories is defined for each item. However, attempting to order all conceptions of a 
phenomenon from "worst" to "best" may not always be fruitful. In some cases, two or 
more ways of visualizing a problem can be identified, none better or worse than another. If 
these different conceptions have different implications for instruction, than maintaining a 
distinction among them can be useful. 

Norman's (1983) and Gentner and Gentner's (1983) studies of students' models 
for electrical circuits provides an example. These studies suggests that many students 
visualize electric circuits in terms of more familiar physical systems. Some, for example, 
see electric current as analogous to water flow. Batteries are visualized as reservoirs, and 
resistors correspond to constrictions in water flow. This analogy facilitates the solution of 
problems about power sources in parallel and series, but impedes solutions to problems 
about parallel and series resistors. Other students see an electrical power source as 
analogous to a crowd entering a stadium, with resistors as turnstiles through which they 
must pass. This "teeming crowd" analogy facilitates problems about combinations of 
resistors, but offers little insight ii to battery combinations. 

Each of these models captures some aspects of electrical systems. Students using 
either model have a better understanding than students with no mode!, at all. On the other 
hand, neither of these physical models provides a complete understanding of current flow 
or of the operation of circuits. A higher level of understanding requires an appreciation of 
the limitations of the physical analogies as models for circuits. In this sense, students who 
operate with either one of the two models can be thought of as being at similarly 
intermediate levels of understanding. 

From the point of view of traditional test theory and the maximization of test 
reliability, it is difficult to justify distinguishing among students who use the water flow 
analogy and those who use the teeming crowds analogy. Items that distinguish between 
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these two groups arc likely to contribute little to reliability, as their discriminating power is 
among people at similar levels of overall proficiency. But further instruction might well 
differ for the two groups — first explicating the model that a student's responses suggest he 
or she may be using (perhaps intuitively), exploring its uses and limitations, then 
introducing the complementary model and its sphere of usefulness. 

To develop a model for these situations, let us suppose that we can identify K states 
of understanding in a learning area, subsets of which may be ordered, but others of which 
may not be. Items are characterized by identifiable features that determine their difficulties 
within these states. In the electrical circuits example, for instance, resistor problems are 
relatively easier than battery problems for students using the teeming crowds analogy, 
while the battery problems are relatively easier for those using the water flow analogy. 
From each student's responses, we wish to infer his or her state of understanding (<t> n , 
which ranges from 1 to K) and degree of proficiency within that state (9 n ). 

The essence of this approach is that while a single proficiency summary of 
performance fails to characterize important differences among learners, it may suffice in 
some applications to use a single proficiency to characterize differences among learners in 
the same type of understanding, while further distinguishing among these qualitative states. 
The fact that these variables can never be known with certainty is reflected by the nature of 
the inferences that are drawn about students: probabilities that the student is in the possible 
states, and an estimate of proficiency corresponding to each possibility. 

The details of such models are given by Mislevy and Verhelst (1990). In the case 
of items scored right or wrong, the probability of a correct response to Item i from Person 
n, who is in state k of understanding (<t>n=k) and has proficiency 0 n , is given as: 

P(x n i =1 1 9 n , <t>n=k, Pit) = fk(e n ,Pik) . [2] 
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where characterizes such features of Item i as its difficulty and f^ is a function relating 
examinee and item parameters to probabilities of correct response — both as pertain to 
persons in level k only . When persons from only one level are under consideration, [2] is 
a standard IRT model. The item parameters can be expected to vary from one level of 

understanding to the next, however — and indeed they must vary if the model is to be 
practically useful for distinguishing students at one level from those at another. 

To illustrate the approach, we present highlights of a one of many aspects of an 
analysis carried out by Wilson (1984), using Robert Siegler's (1978, 1981) data and rule- 
acquisition perspective. For additional examples, the reader is referred to Mislevy and 
Verhelst (1990), Mislevy, Wingersky, Irvine, and Dann (in press), and Wilson (1989b). 

Figure 4 shows two of Siegler's six balance beam problem prototypes. In E 
("Equal") items, both the weights and distances are the same on the two sides of the scale, 
and the correct answer is that the beam will balance. In S ("Subordinate") items, the same 
numbers of weights are on both sides, but on one side they are further from the fulcrum. 
That side will tip down. Following Piaget (Inhelder & Piaget, 1958; Piaget, 1960), Siegler 
posits that children typically exhibit distinct stages as they acquire competence in 
proportional reasoning, adding to their repertoire the increasingly sophisticated rules listed 
in Table 3. Children can thus differ as to their stage of understanding, or their proficiency 
in using the rules they currently command. In particular, a qualitative shift occurs when a 
child apprehends the salience of distance in balance beam problems. Before this 
realization, children see no systematic, relevant, differences between E and S items, and 
tend to predict the beam will balance in both situations. 



Insert Table 3 and Figure 4 about here 
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Among other analyses, Wilson (1984) analyzed responses to four E and four S 
items from two perspectives. The first was based on the Rasch IRT model for right/wrong 
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items. Under the Rasch model, the prcL.oility that Person n will respond correctly to Item 
i is a function of the person's proficiency parameter, 8 n , and the item's difficulty 
parameter, Pi: 



(Note the similarity of [3] to [1]; the Rasch model for right/wrong items is a special case of 
the PCM). Figure 5 illustrates the results. The relative positions of an item and a person 
on the scale (8 n -(3i) determine the probability of a correct response through [3]. Not 
surprisingly, S items are seen to be harder than E items. If the Rasch model were correct, 
increasing competence would be reflected in similar increases in the chances of correct 
response to both E and S items. But analyses of person-fit to the Rasch model revealed 
relatively fewer correct answers to S items from many children who did well on E items, 
and relatively fewer incorrect answers to E items from children who did well on S items, 
than would be expected under the Rasch model. 

Wilson resolved these anomalies in the second analysis, based on his "Saltus" 
(Latin for "leap") model for development that occurs in stages. Saltus extends the Rasch 
model by incorporating stage membership parameters for persons and "Saltus parameters" 
that allow for discontinuities such as the transition from Rule I to Rule II. In this analysis, 
children who had not experienced the transition were modeled in accordance with [3]; those 
who had. were modeled by a model of the same form, but with the Saltus parameter x 
subtracted from the difficulty parameters of S items. In terms of Equation [2], fi and frj 
both have the functional form given in [3], (5iii=(5ii for E items, and Piii=Pii -t for S items. 
Figure 6 illustrates the effect. In effect, % measures the quantitative effect on performance 
associated with a qualitative change in understanding. 




expK-ft) 
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Insert Figures 5 and 6 about here 



8 . OTHER APPROACHES 

The psychometric literature has begun to offer models that might be used to guide 
the construction and analysis of achievement tests of the kind proposed here. Some are 
mentioned below. 

Wilson's (1984, 1989b) Saltus model for hierarchical stages of development 
(illustrated above) provides a stochastic framework for psychological models such as 
Piaget's (1960) and Siegier's (1978, 1981) that posit predictable discontinuities in 
proficiencies as development occurs, and educational models such as Gagn6's (1968) and 
Riley's (Riley, 1981; Riley, Greeno, & Heller, 1983) that posit detectable patterns of task 
difficulties as students progress through successive levels of competence. 

Latent class models (e.g., Haertel, 1984, 1989; Haertel& Wiley, in press; 
Macready and Dayton, 1980) accommodate nonordcred states of competence and 
reconfigurations of proficiencies, without further differentiating students within a state. 
Computational limitations to less than about ten items per student have all but precluded 
their use for measuring individual achievement Recent developments by Paulson (1985) 
and Yamamoto (1987) enable the use of these models with up to sixty items, opening the 
door to precise estimation for individual students and even potentially adaptive testing 
(Macready & Dayton, 1989). 

Yamamoto (Yamamoto, 1987; Yamamoto & Gitomer, in press) has also introduced 
a "hybrid" model for a mixture of latent classes and an IRT class No claim is made that 
such a mixture accurately reflects the psychological reality of students' behavior, but a 
practical advantage is emphasized: Explicit classes can be defined to correspond to 
available instructional options while an amorphous IRT class accounts for potentially large 
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numbers of remaining classes, distinctions among which are irrelevant to the decision that 
must be made. 

Another approach that leans on IRT to handle bookkeeping tasks in complex 
problems is Kikumi Tatsuoka's (K.K. Tatsuoka,1983, 1989; K.K. Tatsuoka & M.M. 
Tatsuoka, 1987) "rule space" procedure. A standard IRT model is first fit to item 
responses. If the IRT model were correct, estimates of persons' proficiency would account 
for all systematic patterns within the data. But Tatsuoka then calculates an index of lack of 
fit from the IRT model, and studies the joint distribution of proficiency estimates under the 
IRT model 2nd indices of lack of fit from that model. The ordered pairs of proficiency 
estimates and fit indices often suffice to identify systematic patterns of response that 
correspond to particular solution strategies, thereby identifying users of particular 
erroneous rules as well as correct rules. 

Embretson's (1985, in press) model for multiple strategies requires identifying 
different sequences of component subt ks that can be used to solve problems. This 
approach can be applied when it is possible to observe the results of subtask operations as 
well as a global result, and, as such, is amenable to procedures described above which 
enable the definition of levels of understanding for identified composite tasks. If levels of 
understanding are ordered, the results of microanalyses using Embretson's model could 
serve as input to achievement measurement via the partial credit model. 

Our discussions and examples have addressed relatively simple situations, with a 
single developing concept with just a few stages. As such, however, they constitute 
building blocks for characterizing students' knowledge with respect to larger systems of 
interconnected concepts. The interested reader is referred to Mislevy, Yamamoto, and 
Anacker (in press) on the possibility of constructing Bayesian inference networks for this 
purpose. 
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9. CONCLUSION 

Recent developments in cognitive and educational psychology reveal that most 
meaningful learning contrasts markedly to the type of learning implied by standard 
psychometric procedures — those based on item response theory as well as those using 
classical true-score test theory. The difference is characterized by the discontinuities of 
real-world learning, as learners reconfigure their knowledge, combine existing skills in 
new ways, and develop alternative strategies for solving problems. 

It is possible to build achievement tests that measure learning of this kind. It is not 
possible to do so with traditional item writing rules, test construction procedures, and 
scoring formulas. To operationalize the new approach, the structure of learning is integral 
at each step along the way, from writing items through reporting achievement. In return 
for this greater investment in the psychology of the learning area, one can expect a greater 
utility: a measure of achievement which, by reflecting the nature of competence as attained 
thus far, sets the stage for further learning. 
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Table 1 

Outcome Categories for Single-Digit Addition 
(e.g., 6+8 = ?) 



Category Description 

4 Does not need to count objects, but uses number facts to 

solve 6+8 = 14. 

3 Always counts.on from the larger number ("8; 9,10,..., 14"). 

2 Counts on, but not consistently from the larger n;arnber 
("6; 7,8 14"). 

1 Counts out 6 objects and 8 objects and then counts them all 
("1.2 14"). 

0 Unable to solve. 



Table 2 

Hypothetical Data Matrix for Single-digit Addition 
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Note: Table entries are observed outcome categories, coded from 0 to 4. 
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Table 3 

Hierarchy of Rule Acquisition 



Rule Description 

Rule 0 Salience of neither weight nor distance recognized; answers depend on 

personal factors. 

Rule I If the weights on both sid^s are equal, it will balance. If they are not 

equal, the side with the heavier weight will go down. (Weight is the 
"dominant dimension," because children are generally aware that weight 
is important in the problem earlier than they realize that distance from the 
fulcrum, the "subordinate dimension," also matters.) 

Rule II If the weights and distances on both sides are equal, then the beam will 

balance. If the weights are equal but the distances are not, the side with 
the longer distance will go down. Otherwise, the side with the heavier 
weight will go down. (A child using this rule uses the subordinate 
dimension only when information from the dominant dimension is 
equivocal.) 

Rule m Same as Rule n, except that if the values of both weight and length are 

unequal on both sides, the child will "muddle through" (Siegler, 1981, 
p.6). (A child using this rule now knows that both dimensions matter, 
but doesn't know just how they combine. Responses may be based on a 
strategy such as guessing.) 



RulelV 



Combine weights and lengths correctly (i.e., compare torques, or 
products of weights and distances). 



A 
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Figure 1 

Common Responses to a Physics Task 
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Figure 2 

Observed Proportions of Children in Each of Five Ordered 
Outcome Categories on a Single-digit Addition Item 
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Model Probability 
Figure 3 

Modeled Probabilities of Responding in Each of Five Ordered 
Outcome Categories on a Single-digit Addition Item 
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Will the beam tip left, tip right, or stay flat 
when the gray blocks are taken away? 




Figure 4 
Prototypical Balance Beam Items 
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Saltus Model Representation 
of Balance Beam Items 
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